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ABSTRACT 



A method and apparatus for issuing a non-Wdcking system 
call to an I/O interface process, the non-blocking system call 
identifj^ng a portal from an application process, and polling 
the portal to determine if an I/O request is complete, the I/O 
interface process: polling an I/O device in response to the 
non-blocking system call to determine if the I/O operation is 
complete; and indicating that the I/O operation is complete 
using the portal. 

20 Claims, 6 Drawing Sheets 
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SYSTEM FOR PROVIDING polling the portal to determine if aa I/O request is complete, 

ASYNCHRONOUS I/O OPERATIONS BY the I/O interface process: poUing an I/O device in response 

IDENTIFYING AND POLLING A PORTAL to the non-blocking system call to determine if the I/O 

FROM AN APPLICATION PROCESS USING operation is complete; and indicating that the I/O operation 

A TABLE OF ENTRIES CORRESPONDING S is complete using the portal. 
TO I/O OPERATIONS 

BRIEF DESCRIPTION OF THE DRAWINGS 

BACKGROUND OF THE INVENTION in** u ^ . c 

FIG. 1 illustrates one embodiment of a computer system 

1. Field of the Invention of the present invention. 

The present invention relates to the field of massively FIG- 2 illustrates one embodiment of the method of 

parallel con^uter systems; more particularly, the present dynamically creating I/O proxy processes in response to 

invention relates to a method and apparatus for using selected events in the computer system of FIG. 1. 

multiple input/output (I/O) proxy processes in a massively FIG. 3 illustrates another embodiment of the method of 

parallel computer. dynamically creating I/O proxy processes in response to 

2. Description of Related Art selected events in the computer system of FIG. 1. 

Many scientific applications such as nuclear explosion PIG. 4 illustrates another embodiment of a computer 

simulation, seismic exploration, and weather forecasting system of the present invention. 

require large quantities of processing power and are thus FIG. 5 illustrates one embodiment of the method of 
ideal for massively parallel processor (MPP) computers. 20 handling I/O requests in the computer system of FIG 4 

^Jl^^'l'^''^'^^^^^^ FIG. 6 illustrates one embodiment of the method of 

nodes--thataremt^^^^^ handling I/O requests in a computer system of FIG. 4. 

or more applicaUon processes and a large number of «» i *• / . ^.j. t. 

processors— service and input/ou^nit (1/0) nodes— to per- DETAILED DESCRIPTION 

form I/O services for the application processes. 25^1. . . , , 

, . ^ . „ 1 he present invention IS a method and apparatus to more 

A first operating system such as the Cougar operating completelyexploitthebandwidthof input/output (I/O) hard- 

sys em, is run on the compute nodes and a second operating ware in a massively paraUel processor (MPP) computer 

system, such as a version of the Open System Foundation ,„ „„, ,„i,„h™, » h-cdd . • . ^ 

(OSF/1) operating system, is run on the Ltvice nodes and '° l^n^L^ ^ MPP computer mdudes a set of 

the I/O nodes. The first operating system is typically a 30 confute nodes (nmmng an awhcation process and a Ubr«^ 

light-weight operating system optimized for performance, semce nodes (runnmg one or more I/O 

scalabiUty, and availability in running the application pro: S^'p^^t^A^' '""^ subsystem that includes a file 

cesses. In order to make this operatmg system lightweight ff^ ^''^'f. '^"^ ^ '^^''^ 

the first operating system typically does not include any I/O "^"''^^^ °' ""'^ application processes through the 
capability. In contrast, the second operating system is typi- 35 ' ^orresponding library process. Each I/O proxy process inter- 

cally a^tyJaBctroniaISIfcEigJst4,^a^^^^^ ^^''^ "^'^ ^''^y-^'*'" '° P'"'*^^ I/O calls, 

performing I/O services. I/O proxy'proreM^nmimd^^ ""^ .^P*" mvenUon is a method and apparatus for 

second operating system as a proxy for applications running dynamically creating I/O proxy processes (running on the 

/inder the first operating system. These I/O proxy processes service nodes) in response to certain events to more eflS- 

V -^-'provide I/O services to the applications *o "enUy use computer resources. One event may be an 1/0 

L. Cv*J^>*^e processing power of an MPP computer typically H """^ * f t f P^^^^ 

♦^"^ scales efficiently T^th the number of compL no^ Se ^T^h, '"^ r ' ^'^^f'"'^- 

volume of I/O requests typicaUy increases L the numbe of Zl^Jr^ru^f T l'^"^ '"^^ to iise a new 

compute nodes increases °^ "^^"^ proxy 

.,.,.,„ . . 45 processes assigned. Yet another event may be a user-reauest 

As the number of /O requests mcre^e, the ability of the ,o generate more I/O proxy processes. oLr ev^s rna^re 

i(?„T^ processes to handle these I/O requests will even- used to tdgger the dynamic creation of I/O proxy process 

hniL^vrrr v^".'"r.S°^T^n^^°''°'"'^"^ Bydynanucdlycreatingl/OpK,xyprocesses,theDiX?f 

bot leneck imits the ability of the 1/0 proxy processes to I/O proxy processes may be controlled to more completely 

more completely utilize the bandwidth of the I/O hardware, exploit the bandwidth of the 1/0 hardware 

, '^''"'1 1f^^ application processes is Another aspect of the invention is a method and apparatus 

a blocking (synchronous) 1/0 request. When an I/O proxy for pn,viding non-blocking (asynchronous) I/O calU to the 

E'.r"'' \"°^'^°f " ■'.r'P^ ""T'- P'^'^y P~- 1" one ei4odimen,, i Ubrarproc^ss 

I/O oLE It'^T^' ("'""'"S on the compute nodes) transpa;enfly tr^slates a 

clXerRv rHnt^nrf?" '^^w"" « blocking (synchronous) 1/0 call from an appUcation process 

? . ^ ^ I/O proxy process unavailable for (running on the compute nodes) to a non-blocking I/O call 

periods of time, the abihty of the I/O proxy process to more issued to an I/O proxy process (running on the service 

^mpletely uuhze the bandwidth of the I/O hardware is nodes). Since the I/O prL? process\eceivL rnon blocZg 

* • J J • ^^^^ ^ blocked and is therefore available to process 

What IS needed is a method and apparatus to more 60 other I/O calls while waiting for the non-blocking I/O call to 

completely exploit the bandwidth of I/O hardware in a complete. By increasing the availability of the I/O proxy 

massively parallel processor (MPP) computer. processes, the I/O proxy processes are able to more com- 

SUMMARY OF THE INVENTION ^^^^^^ ^^^^""'^ bandwidth of the I/O hardware. 

In one embodiment, the non-blocking I/O call includes a 

A method and apparatus for issuing a non-blocking sys- 65 portal. A portal includes a pointer to the address space of the 

tem call to an I/O mterf ace process, the non-blocking system issuing process (in this case the appUcation process) so that 

call Identifying a portal from an application process, and information can be transferred direcUy to the issuing pro- 
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cess. A portal key may be provided with the portal such that 
access through the portal is only provided to I/O requests 
that include the portal key. Alternatively, a portal key is not 
provided and all I/O requests may access the portaL In one 
embodiment, the portal is managed by the operating system 
in such a way that it is transparent to the issuing process. 

By using a portal, th e I/O proxy process can provide the 
status of thernon-bxocjcing I/O call to the library process 
through tjje portal when the I/O operation is completed 
ratherjhin the library process (on a set of compute nodes) 
ftedly issuing non-blocking I/O calls to the I/O proxy 
process (on the service nodes) to check the status of the I/O 
operation until the I/O operation is completed. By reducing 
the use of communication cycles between the set of compute 
^0\lJ^\/ nodes and the service nodes, more of the bandwidth of the 
I/O subsystem is made available for other I/O operations. 

Although, each of these aspects of the invention may be 
practiced independently, one implementation employs both 
aspects of the invention. 

FIG. 1 illustrates one embodiment of a computer system 
of the present invention. The computer system includes a set 
of compute nodes 150 including a set of compute nodes 100 
and a set of compute nodes 110, a set of service nodes 170 
including a set of service nodes 175 and a set of service 
nodes 180, and a file server 140, The set of service nodes 175 
is coupled to the set of compute nodes 100 via a 2 dimen- 
sional (2D) bidirectional mesh 160 and coupled to the file 
server 140 via a 2D directional mesh 162. The set of service 
nodes 180 is coupled to the set of compute nodes 110 via a 
2D directional mesh 161 and coupled to the file server 140 
via a 2D directional mesh 163. In one embodiment, the 2D 
directional mesh 160, the 2D directional mesh 161, the 2D 
directional mesh 162 and the 2D directional mesh 163 are 
part of the same 2D bidirectional mesh interconnecting the 
compute nodes 150, the service nodes 170 and the I/O nodes 
(not shown). However, the present invention may be prac- 
ticed with other interconnect configurations. 

FIG. 1 illustrates an application process run on the set of 
compute nodes 150 and a ^ of I/O proxy processes 12 1 
running on a set of service nodes 175 and a set of I/O proxy 40 
processes 131 running on a set of service nodes 180. The set 
of compute nodes 100 and the set of compute nodes 110 do 
not necessarily indicate a physical partition of the set of 
compute nodes 150. In one embodiment, the number of 
compute nodes and the particular compute nodes included in 45 
each of the sets of compute nodes is determined by software 
control. 

When a compute node in the set of compute nodes 100 
generates an I/O call, the I/O call is directed to one of the set 
of I/O proxy processes 121 through the 2D directional mesh 50 
160. In one example, the I/O call is a request to open a file 
and the I/O call is directed to a current I/O proxy process 120 
of the set of I/O proxy processes 121. In another example, 
the I/O call is a write or read operation to a particular file, 
the I/O call is directed towards the I/O proxy process that 55 
opened that particular file, the I/O proxy process being in the 
set of I/O proxy processes 121. The I/O proxy process that 
receives the I/O call issues a corresponding I/O call to an 
emxilation library (not shown). An emulation library inter- 



When a compute node in the set of compute nodes 110 
generates an I/O call, the I/O call is directed to one of the set 
of I/O proxy processes 131 through the 2D directional mesh 
161, In one example, the VO call is a request to open a file 
5 and the I/O call is directed to a current I/O proxy process 130 
in the set of I/O proxy processes 131. In another example, 
the I/O call is a write or read operation to a particular file and 
the I/O call is directed towards the I/O proxy process that 
opened that particular file, the I/O proxy process being in the 
10 set of 1/0 proxy processes 131. The 1/0 proxy process that 
receives the 1/0 call issues a corresponding 1/0 call to an 
emulation library (not shown). The emulation library inter- 
acts with an I/O server 140 on the 2D directional mesh 163 
to process the I/O request corresponding to the I/O call 
15 received by the emulation library. 

If the number of compute nodes in the set of compute 
nodes 100 is increased for a fixed number of I/O proxy 
processes in the set of I/O proxy processes 121, the set of I/O 
proxy processes 121 may not be able to efficiently handle the 
volume of I/O calls generated by the set of compute nodes 
100. In prior art computer systems, if the number of files 'x-V-i 
opened by the set of compute nodes 100 reaches the limit of J 
file descriptors for an I/O proxy process, subsequent 1 
requests to open files causes the I/O proxy process to close y \ 
at least one of the open files to free a file descriptor. Opening l^i— ' 
and closing files increases the number of system calls and 
each of these system calls typically uses a context switch. 
The commimication cycles associated with context switches 
degrades performance. In addition, a standard UNIX process 
has a limit of 64 file descriptors. This limits the number of 
open files that a single I/O proxy process in the set of I/O 
proxy processes 120 can manage for the compute nodes 100. 
It is not unusual for an I/O proxy process to service hundreds 
of compute nodes. A limit of 64 open files for 200 compute 
nodes> for example, can lead to performance degradation for 
the reasons described above. 

The present invention provides for the dynamic creation 
of additional I/O proxy processes in a set of 1/0 proxy 
processes in response to an event. By allowing for the 
dynamic creation of I/O proxy processes when the number 
of open files is at the limit available to the running 1/0 proxy 
processes, the performance degradation associated with 
closing files to make file descriptors available is avoided. In 
one embodiment, an enhanced version of UNIX is used to 
provide more file descriptors so that more files may be 
opened by each I/O proxy process. In addition, the dynamic 
creation of I/O proxy processes when additional compute 
nodes are used by the application process allow I/O proxy 
processes to be adjusted in response to processing condi- 
tions. 

FIG. 2 illustrates one embodiment of the method of 
dynamically creating I/O proxy processes in response to ' 
selected events. The method is descr&ed with reference to 
FIG. 1. In this embodiment, the method is implemented 
using a control process. 

In step 200, a control process receives a user-request to 
create N I/O proxy processes. For example, a user may 
request that 2 sets of I/O proxy processes are created when 



20 



25 



30 



35 



prets I/O calls and interfaces with the file system to process 60 starting an application process. Alternatively, the user may 

♦K*.o« i/n ^oiio . — 1 ... request an additional two I/O proxy processes be created for 

an application process that is already running. 

In step 210, a control process dynamically creates two sets 
of I/O proxy processes, each of the I/O proxy processes 
corresponding to a set of compute nodes. For example, if 
there are 256 compute nodes in the compute nodes 150, the 
set of I/O proxy processes 121 are assigned to the 128 



these I/O calk. Here, the emulation library interacts with an 
I/O server 140 via the 2D directional mesh 162 to process 
the I/O call corresponding to the I/O request. In one 
embodiment, an emulation library is dynamically linked to 
each I/O proxy process. In one embodiment, a file server 
protocol, such as the parallel file system (PFS), is imple- 
mented in the emulation library. 



65 
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compute nodes in the compute nodes 100 and the set of I/O new I/O proxy process in the set of I/O proxy processes 131. 

proxy processes 131 are assigned to the 128 compute node The new I/O proxy process becomes the current I/O proxy 

in the compute nodes 110. Alternatively, the two sets of I/O process in the set of I/O proxy processes 131. Subsequent 

proxy processes already have been created and an additional file open requests arc now routed to the new I/O proxy 

I/O proxy process is created for each set of I/O proxy 5 process. Step 350 is performed. 

process in response to a user-request that an additional two in step 340, the control process opens a file in the current 

I/O proxy processes be created. i/q proxy process. For example, if the control process had 

In another example, there are 256 compute nodes in the determined that the number of open files in the current I/O 

compute nodes 150 and the user requests four I/O proxy proxy process 130 is not equal to N, the control process 
processes. Then each of the I/O proxy processes are assigned 10 would open a new file in the current I/O proxy process 130. 

to the 64 compute nodes. Thus, I/O requests from compute Step 350 is performed. 

nodes 0, 1, 2 ... 63 are serviced through the first set of I/O [q step 350, the control process determines whether a 

proxy processes, I/O requests from compute nodes 64, 65, request to read or write to a file has been made. If a request 

66 . . . 127 are serviced through the second set of I/O proxy to read or write to a file has been made, the control process 
processes, etc. 15 performs step 360, Otherwise, the control process performs 

In one embodiment, only a single I/O proxy process is step 380. 

initially created for each set of I/O proxy processes. Instep360,oneof the sets of I/O proxy processes receives 

Alternatively, two or more 1/0 proxy processes are initially the read or write request. Which of the multiple sets of I/O 

created for each set of I/O proxy processes. In either case, proxy processes receives the read or write request depends 
additional VO proxy processes may be dynamically created ^0 on which of the sets of compute nodes the read or write 

as descnbed with reference to HG. 3. request is from. For example, in the configuration iUustrated 

FIG. 3 illustrates another embodiment of the method of in FIG. 1, the set of I/O proxy processes 121 receives read 

dynamically creating I/O proxy processes in response to or write requests from the set of compute nodes 100 and the 

selected events. The method is described with reference to set of I/O proxy processes 131 receives read or write 

FIG. 1. In one embodiment, the method is implemented requests from the set of compute nodes 110 

using a control process. I„ step 370, the I/O proxy process in the set of I/O proxy 

^ step 309^ [he f^ntr^)! prnrrfffT'^ rlPt^r^in^c wh^ft^^r n ■ processes that opened the file to which the read or write is 

request to open a file. hp<^ mi.r^^ I^ ji mQu e ^ tn nprn „ directed processes the read or write request. For example, in 

fije^basJaeea-att d e rth t ount r n l pr n n ofin p i ifin iii , step 310. the configuration illustrated in FIG. 1, a first I/O proxy 

Dtherwi<u> fhp mntrnl,prnrpc<: pprfnrm^ ^^^p <i<n p^cess of the set of I/O proxy processes 131 receives a read 

In step 310, one ofthe sets of I/O proxy processes receives requests to a first file, if the first I/O proxy process had 

the open file request. Which of the multiple sets of I/O proxy opened that first file. Step 380 is performed, 

processes receives the open file request depends on which of In stqj 380, the control process determines if the appli- 
the sets of compute nodes the open file request is firom. For 35 cation process has begun to use a compute node from a set 

example, in the configuration illustrated in FIG. 1, the set of of compute nodes for which there is not a set of I/O proxy 

I/O proxy processes 121 receives open file requests firom the processes. If the application process has begun to use a node 

set of cotnpute nodes 100 and the set of I/O proxy processes from a new set of compute nodes, the I/O proxy process 

I ^^^r^ requests fi:om the set of compute perfomis step 390, Otherwise, the control process performs 

nodes 110. step 300. 

In step 320, for the set of I/O processes in which the open In step 390, the control process dynamically creates a new 
hie request is directed, the control process determines I/O proxy process for the new set of compute nodes Any file 
whether the number of open files in the current I/O proxy open requests and other I/O calls from this set of compute 
process is equal to N where N is the open file limit of a nodes are then processed by the current I/O proxy process 
process under that operating system. If the number of open 45 for the set of I/O proxy processes 130 
rnni^^ ^^reut I/O proxy process is equal to N, the Dynamically creating I/O proxy processes in response to 
control pro^ss performs step 330. Otherwise, the control-n certain events allows S)mputer resources to be more effi- 
process performs step 340 For example, if the control J ciendy used. In one embodiment, only one I/O proxy 
pmcess had determined that the open file request had come process is run on each service nod;. Alternatively, two or 
from the set of compute nodes 110 m step 310, the control 50 more 1/0 proxy processes are run on each service node 
process would determme whether the number of open files Generally, as the iumber of VO proxy processes running on 
m the current I/O proxy process 130 is equal to N. each node increases, the less confputTr resources is aUocated 
In one embo Jment, the operating system is a standard to each VO proxy process. In one embodiment, a round- 
version of UNIX and N is 64. Alternatively, the operating robin method of distributing newly created I/O proxy pro- 
system is an enhanced version of UNIX and N is 2048. In ss cesses on the set of service nodes may be used to equally 
one embodiment, the number of open files available in the distribute the load on the set of service nodes However 
enhanced version of UNIX is achieved by providing 64 bit other methods may distribute the newly created I/O proxy 
operands and an 11 bit file identification field in the 64 bit processes on the service nodes with consideration for other 
operand. Other unplementations, such as those that use factors such as the relative load on each I/O proxy process 
different size operands and file identification fields, may be 60 and the performance of each particular service node in the 
used. Other values of N may be used. set of service nodes. 

In step 330, the control process dynamicaUy creates a new In one embodiment, the computer system is a distributed 

I/O proxy process by cloning the current I/O proxy process memory, Multiple InstmcUon Multiple Data (MIMD) mes- 

for that set of compute nodes. For example, if the control sage passing machine having scalable communication 

process had determmed that the number of open files in the 65 bandwidth, scalable main memory, scalable internal disk 

current I/O proxy process 130 is equal to N, the control storage capacity, and scalable I/O. One such computer is the 

process clones the current I/O proxy process 130 to create a Intel Teraflops (TFLOPS) Computer. One implementation 
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includes 4,500 compute nodes each containing 2 Intel Pen- described with reference to the computer system of FIG. 4. 

tium® Pro processors coupled together via 2D directional Hie computes nodes are running a library process and an 

mesh interconnect having a bandwidth of 400 megabytes/ application process. The service nodes arc running at least 

second (MB/s), 32 service nodes, 401/0 nodes coupled lo 34 one I/O proxy process and an emulation library. The I/O 
redundant arrays of inexpensive disks (RAIDS) each storing $ node is running a server process. 

32 gigabytes (GB), two 1 terabyte (TB) RAID storage in step 500, the Ubrary process receives a blocking I/O 

systems, and 600 GB of main memory to derive 1.8 teraflops call from the application process. A blocking I/O call typi- 

(pcak) performance. The compute nodes run the Cougar cally halts availability of the receiving I/O proxy process to 

operating system consisting of a Quintessential Kernel other I/O requests until the I/O request corresponding to the 

(Q-Kemel), a Process Control Thread (PCI), utilities such blodcing I/O call is complete. For example, the application 

as yod and fyod. The service nodes and the I/O nodes run a niay issue a blocking I/O call known as iowailL.) to deter- 

version of the Open System Foundation (OSF/1) operating mine if a previous non-blocking I/O r6quest7iwritc( ) has 

system. completed. 

In one embodiment, the control process is a yod and each In step 510, the library process issues a non-blocking I/O 

I/O proxy process is an fyod llie yod is an OSF/1 utility that call to the I/O proxy process. The non-blocking I/O call 

runs on one of the service nodes, and controls the application corresponds to the blocking I/O call and is issued with a 

process on the compute nodes 150 including reading the portal that includes a pointer and a portal key. For example, 

application executable file, obtaining the compute nodes 150 the library process receives a blocking iowait( ) call and 

to run the application, transmitting an application executable issues a non-blocking iodone( ) call. The non-blocking call 

file to the compute nodes 150, and starts the execution of the includes a status portalTTIlB^tus portal has a pointer to the 

application executable file on the compute nodes 150. All the application process and a portal key. In one embodiment, the 

UNIX system calls fi-om the application process are directed status portal is managed by the operating system on the 

to the yod. An fyod is an interface between the application compute nodes in such a way that it is transparent to the 

process and the I/O subsystem. All the I/O requests from the library process and the application process, 

application process are directed to an fyod. Some fyods may step 520, the library process polls the status portal to 

be started by the yod as a child process before it starts the determine if the I/O operation is complete. Since the status 

application (statically). Other fyods may be started in portal is available locally, polling the status portal does not 

response to certain events according to the method described ^ bandwidth between the service nodes and the compute 

above (dynamically) nodes. This leaves more bandwidth available to other opera- 

It will be apparent to one skilled in the art that numerous ^ Slt'S file ^Zf'^''' "^^^^ 

computer hardware and software configurations may be used r * ma *u i*J ^ ■ 

consistent with the spirit and scope of the present invention. r/n ^^t^ • t"^^ process determmes whether the 

- .„ , , ^ u J- T ^ mvcuiiuu. yQ operation is complete. If the I/O operation is complete, 

HG. 4 Illustrates one embodmient of a computer system the library process performs step 540. Otherwise, the library 

of the present invention. process performs step 520. 

Hie computer system includes a set of compute nodes, a 35 In step 540, the library process indicates to the application 

set of service nodes and a file server including an I/O node process that the I/O request is complete by sending the I/O 

and an I/O device. Although a single I/O node and a single status 413 to the application process. In one embodiment, 

I/O device is shown, it will be apparent to one skilled in the the application process is blocked until it receives the I/O 

art that the present invention may be practiced with multiple status and the translation of the blocking call to a non- 

I/O nodes each having one or more I/O devices. blocking call by the library is transparent to the application 

The I/O device(s) may include any device capable of process, 

transferring information to a local or a remote location. For In one embodiment, the library process issues a blocking 

example, the I/O device(s) may include a RAID, a hard disk system call instead of performing steps 520 and 530. In one 

drive, a compact disk read-only -memory (CD-ROM) drive, example, the library process receives a blocking I/O call, 

a floppy disk drive, a tape drive, a network device (capable 45 s^ich as a c^rilaCXjcall, and issues a corresponding non- 

of interfacing to a local area network, for example). In one blocking I/O call, such as an iwrite( ^ call, followed by a 

embodiment, the I/O device is capable of reading and/or blocking I/O call, such as an iowait( ) call, to determine the 

writing to a computer readable medium 420. The computer status of the iwrite( ) call. 

readable medium 420 may be a floppy disk, CD-ROM, or a FIG. 6 Qlustrates one embodiment of the method of 

tape cartridge, for example. The computer readable medium 50 handling I/O requests in a computer system. The method is 

420 may be a carrier wave such that information is contained described with reference to the computer system of FIG. 4. 

in a signal that is superimposed on the carrier wave. In one The computes nodes are mnning a library process and an 

embodiment, the computer readable medium 420 contains application process. The service nodes are running at least 

instructions, which when executed on a computer system one I/O proxy process and an emulation library. In one 

performs an embodiment of a method described herein. 55 embodiment, the emulation library is dynamically linked to 

An application process and a library process are run on the I/O proxy process. The I/O node is running a server 

one or more of the compute nodes. An I/O proxy process and process. 

an emulation library are run on one or more of the service In step 600, the I/O proxy process receives a non-blocking 

nodes. In one embodiment, one or more I/O proxy processes I/O call firom the library process. The non-blocking call 

(in one or more sets of I/O proxy processes) are run on the 60 includes a portal pointer and a portal key for a status portal, 

set of service nodes. In one embodiment, at least some of In step 610, the I/O proxy process stores the portal pointer 

these I/O proxy processes are dynamically created as, and the portal key in an entry in an I/O table that stores 

described with reference to FIGS. 1 and 2. In another pending (outstanding) I/O calls. The entry corresponds to an 

embodiment, one or more I/O proxy processes are staticaUy I/O operation. 

generated. ^5 ^^^p ^^o^ the I/O proxy process accesses a valid entry 

FIG. 5 Illustrates one embodiment of the method of in the I/O table that corresponds to an outstanding non- 

handling I/O requests in a computer system. The method is blocking I/O call. 
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In step 630, the I/O proxy process issues a oon-blocldDg 
system call to an emulator library. The non-bbcking system 
call corresponds to the non-blocking I/O call. The emulation 
library polls the server process to determine if the I/O 
operation corresponding to that non-blocking I/O call is s 
complete. The server process determines if the correspond- 
ing I/O operation has completed and returns an I/O status 
410 to the emulation library. 

In step 640, the I/O proxy process determines if the I/O 
operation is complete. If the I/O operation is complete, the lo 
I/O proxy process performs step 620. If the I/O operation is 
not complete, the 1/0 proxy process performs step 650. 

In step 650, the I/O proxy process sends the I/O status 412 
through the portal using the portal pointer and the portal key. 
The I/O status 412 indicates that the I/O operation corre- 
sponding to that non-blocking I/O call is complete. 

In step 660, the I/O proxy process invalidates (or deletes) 
the entry of the I/O table that corresponds to the completed 
I/O request since it is no longer outstanding. The I/O proxy 
process then performs step 620. 

In one embodiment, the I/O proxy process periodically 
accesses each valid entry in the I/O table such that it 
monitors all outstanding I/O operations. Thus, each time the 
I/O proxy process accesses tfie entry corresponding to a 
non-blocking I/O call in the I/O table, the I/O proxy process 
issues a non-blocking system call to the emulation library 
and receives an I/O status 411. 

The I/O proxy process monitors the status of the outstand- 
ing I/O process by polling an emulation library that is local 
to the service nodes. This avoids communication cycles 
between the service nodes, the I/O node, and the compute 
nodes is used. 

It will be apparent to one skilled in the art that the present 
invention may be practiced with multiple file servers. In one 35 
embodiment, the method and apparatus to dynamically 
create I/O proxy proce$ses is used in conjunction with the 
method and apparatus to perform non-blocking I/O calls. 
Alternatively, these inventions may be practiced indepen- 
dently. 

In one embodiment, data is transferred to the application 
process directly from the server process through a data 
portal. One example of such a file server is Intel's parallel 
file server (PFS). Alternatively, the data is transferred to the 
application process through I/O proxy process. This is 
typically how non-PFS file servers operate. 

What is claimed is: 

1. A method comprising: 

issuing a non-blocking system call to an I/O interface 
process, the non-blocking system call identifying a 
portal from an application process; and 

polling the portal to determine if an I/O request is 
complete, the I/O interface process: 

polling an I/O device in response to the non-blocking 
system call to determine if the I/O operation is com- 
plete; and 

indicating that the I/O operation is complete using the 
portal; wherein the step of polling an I/O device in 
response to the non-blocking system call comprises: 

storing a first entry corresponding to the I/O operation 
into a table, the table containing a plurality of entries 
each corresponding to one of a plurality of I/O opera- 
tions; 

accessing the first entry; and 

determining whether the I/O operation corresponding to 
the first entry is complete. 



2. The method of claim 1 wherein the fizst entry comprises 
a pointer to the portal. 

3. The method of claim 1 wherein the first entry comprises 
a key corresponding to the portal, the key being used to 
access the portal. 

4. The method of claim 1 wherein the I/O operation is a 
non-blocking write operation. 

5. The method of claim 1 wherein the I/O operation is a 
non-blocking read operation. 

6. A method comprising: 

receiving a blocking system call firom an application 
process; 

issuing a non-blocking system call to an I/O interface 
process in response to receiving the blocking system 
call, the non-blocking system call identifying a portal 
from the application process; 
polling the portal to determine if an I/O request is com- 
plete; and 

indicating that the I/O request is complete to the appli- 
cation process when the step of polling the portal to 
determine if the I/O request is complete determines that 
the I/O request is complete, the I/O interface process: 
polling an I/O device in response to the non-blocking 
system call to determine if the I/O operation is com- 
plete; and 

indicating that the I/O operation is complete using the 
portal. 

7. The method of claim 6 wherein the step of issuing a 
non-blocking system call to an I/O interface process in 

30 response to receiving the blocking system call is transparent 
to the application process. 

8. A method comprising: 
receiving a blocking I/O operation; 
in response to receiving the blocking I/O operation; 
issuing the I/O operation, the I/O operation being a 

non-blocking I/O operation; and 
issuing a non-blocking I/O system call to an I/O interface 
process, the non-blocking system call identifying a 
portal from an application process; and 
polling the portal to determine if an I/O request is 

complete, the I/O interface process: 
polling an I/O device in response to the non-blocking 
system call to determine if the I/O operation is com- 
plete; and 

indicating that the I/O operation is complete using the 
portal. 

9. The method of claim 8 wherein the blocking I/O 
operation is a blocking write operation and the I/O operation 
is a non-blocking write operation. 

10. The method of claim 8 wherein the blocking I/O 
operation is a blocking read operation and the I/O operation 
is a non-blocking read operation. 

11. A machine readable medium having embodied therein 
a program which when executed by a machine performs a 
method comprising of: 

issuing a non-blocking system call to an I/O interface 
process, the non-blocking system call identifying a 
portal from an application process; and 
polling the portal to determine if an I/O request is 

complete, the I/O interface process: 
polling an I/O device in response to the non-blocking 
system call to determine if the I/O operation is com- 
plete; and 

indicating that the I/O operation is complete using the 
portal; wherein the step of polling an I/O device in 
response to the non-blocking system call comprises: 
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Storing a first entry corresponding to the I/O operation 
into a table, the table containing a plurality of entries 
each corresponding to one of a plurality of I/O opera- 
tions; 

accessing the first entry; and s 
determining whether the I/O operation corresponding to 
the first entry is complete. 

12. The machine readable medium of claim U wherein 
the first entry comprises a pointer to the portal. 

13. The machine readable medium of claim 11 wherein 
the first entry comprises a key corresponding to the portal, 
the key being used to access the portal. 

14. The machine readable medium of claim 11 wherein 
the I/O operation is a non-blocking write operation. 

15. The machine readable medium of claim 11 wherein 
the I/O operation is a non-blocking read operation, 

16. A machine readable medium having embodied therein 
a program which when executed by a machine performs a 
method comprising of: 

20 

receiving a blodcing system call firom an application 

process; 

issuing a non-blocking system call to an I/O interface 
process in response to receiving the blocking ^system 
call, the non-blocking system call identifying a portal 25 
from the application process; 

polling the portal to determine if an I/O request is com- 
plete; and 

indicating that the I/O request is complete to the appli- 
cation process when the step of polling the portal to 30 
determine if the I/O request is complete determines that 
the I/O request is complete, the I/O interface process: 

polling an I/O device in response to the non-blocking 
system call to determine if the I/O operation is com- 
plete; and 
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indicating that the I/O operation is complete using the 
portal 

17. The machine readable medium of claim 16 wherein 
the step of issuing a non-blocking system call to an I/O 
interface process in response to receiving the blocking 
system call is transparent to the application process. 

18. A machine readable medium having embodied therein 
a program which when executed by a machme performs a 
method comprising of: 

receiving a blocking I/O operation; 

in response to receiving the blocking I/O operation: 

issuing the I/O operation, the I/O operation being a 
non-blocking I/O operation; and 

issuing a non-blocking I/O system call to an I/O interface 
process, the non-blocking system call identifying a 
portal from an application process; and 

polling the portal to determine if an I/O request is 
complete, the I/O interface process: 

polling an I/O device in response to the non-blocking 
system call to determine if the I/O operation is com- 
plete; and 

indicating that the I/O operation is complete using the 
portal. 

19. The machine readable medium of claim 18 wherein 
the blocking I/O operation is a blocking write operation and 
the I/O operation is a non-blocking write operation. 

20. The machine readable medium of claim 18 wherein 
the blocking I/O operation is a blocking read operation and 
the I/O operation is a non-blocking read operation. 
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